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Beeldfusie  (image  fusion)  is  het  proces  waarbij  beelden  opgenomen  met  meerdere 
sensoren  samengevoegd  worden  tot  een  enkel  beeld.  Het  is  een  van  de  vele 
mogelijkheden  binnen  het  bredere  veld  van  sensorfusie.  Dit  rapport  beantwoordt 
vragen  die  horen  bij  het  fuseren  van  twee  soorten  beelden,  m.n.  visueel  licht  en 
thermisch  infrarood,  tot  een  enkel  grijswaardebeeld.  Hiertoe  zijn  verschillende 
beeldseries  opgenomen  en  verwerkt  met  verschillende  methodieken. 

Voor  het  fuseren  van  de  beelden  is  een  optimaliteitscriterium  gedefmieerd. 
Afhankelijk  van  de  aanname  of  de  ingangsbeelden  afhankelijk  dan  wel 
onafhankelijk  zijn  blijkt  dat  de  optimale  oplossing  voor  het  fusieprobleem  een 
gewogen  optelling  is  of  een  piramidale  oplossing  zoals  het  algoritme  van  Burt. 
Voor  beide  methodes  is  afgeleid  hoe  de  ingangsbeelden  gewogen  dienen  te  worden 
om  een  optimaal  resultaat  te  verkrijgen. 

De  resultaten  van  de  beeldfusie  zoals  gepresenteerd  in  dit  rapport  geven  aan  dat 
(voor  de  gegeven  beelden)  het  gefuseerde  beeld  meer  informatie  oplevert  dan  de 
afzonderlijke  beelden.  Hierbij  dient  het  voorbehoud  gemaakt  te  worden  dat  de 
gepresenteerde  beelden  speciaal  uitgezocht  zijn;  in  de  praktijk  levert  vaak  een  van 
de  sensoren  beduidend  meer  informatie  dan  de  andere,  waardoor  het  gebruik  van 
enkel  die  sensor  eigenlijk  alle  relevante  informatie  levert. 

Daamaast  biedt  beeldfusie  de  mogelijkheid  om  ook  informatie  welke  zich  in  het 
infraroodbeeld  bevindt  af  te  beelden  op  een  manier  zoals  die  aansluit  bij  de 
perceptie  van  normale  visuele  beelden. 

Hiemaast  is  bekeken  met  wat  voor  hardware  dergelijke  systemen  tegenwoordig 
gerealiseerd  kunnen  worden.  Het  blijkt  dat  dit  mogelijk  moet  zijn  met  de  nieuwste 
DSP-boards  welke  als  plug-in  in  een  standaard-PC  geplaatst  kunnen  worden. 

Een  eerste  aanzet  is  gegeven  naar  het  verrichten  van  Automatic  Target  Recognition 
(ATR)  op  basis  van  meerdere  input-beeldseries.  De  resultaten  hiervan  zien  er 
veelbelovend  uit. 
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A  Target  recognition  experiment 


1. 


Introduction 


Many  systems  are,  or  could  be,  equipped  with  both  a  visual  CCD  camera  and  a 
thermal  infrared  (IR)  camera.  The  motivation  for  this  is  that  both  cameras  can 
provide  complementary  information.  When  a  system  is  equipped  with  such  a  dual 
camera  system  a  choice  must  be  made  how  the  data  acquired  should  be  presented 
for  the  operator.  Two  basically  different  methods  can  be  used:  separate 
presentation  and  combined  presentation.  This  report  is  about  the  latter. 


1.1  Separate  presentation 

With  separate  presentation  the  data  acquired  with  both  sensors  are  independently 
presented  to  the  operator.  In  our  case  of  two  image  sensors  this  means  that  two 
separate  images  are  presented,  or  both  are  subsequently  imaged  using  the  same 
display  device. 

Both  presentation  methods  mentioned  above  have  their  problems.  Displaying  two 
images  simultaneously  means  that  the  amount  of  hardware  needed  for  that  task 
must  be  larger.  Subsequent  imaging  on  the  same  device  means  that  intervention  of 
an  operator  is  needed  to  switch.  Both  methods  share  the  problem  that  the  operator 
needs  to  divide  his  attention  between  both  images. 

These  problems  make  that  separate  presentation  of  both  sensor  inputs  in  not  ideal. 


1.2  Combined  presentation 

With  combined  presentation  sensor  fusion  is  applied  to  the  sensor  input  before 
they  are  presented  to  the  operator.  Within  this  report  we  will  concentrate  upon 
image  fusion.  This  is  a  special  kind  of  sensor  fusion  where  the  output  of  the  fusion 
operation  also  is  an  image  similar  to  the  inputs. 

Ideal,  the  combined  presentation  should  exhibit  the  following  properties: 

1 .  All  relevant  information  present  in  the  raw  sensor  data  is  present  in  the 
combination. 

2.  The  combined  presentation  looks  natural,  such  that  the  operator  does  not  have 
to  convert  the  output  of  the  system  to  understandable  items. 

3.  No  exotic  hardware  is  needed  to  calculate  or  visualise  the  combined 
presentation. 

As  shown  in  the  remainder  of  this  report,  these  properties  are  not  easily  to  achieve. 
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1.3  Overview 

Chapter  2  describes  the  mathematical  foundation  of  the  image  fusion  processes. 
Chapter  3  describes  a  few  ideas  how  to  solve  the  same  problem  using  multi  colour 
representation.  Experimental  results  are  given  in  chapter  4.  Chapter  5  describes  the 
computational  complexity  of  the  algorithms  used  in  image  fusion.  Finally, 
conclusions  and  recommendations  based  upon  the  presented  material  are  given  in 
chapter  6. 
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2.  Fusion  methods 


This  chapter  describes  what  methods  are  available  to  perform  image  fusion.  Image 
fusion  is  the  process  where  two  input  images  of  similar  modality  are  transformed 
to  one  output  image  of  the  same  modality.  In  this  chapter  we  will  first  discuss 
image  alignment  en  warping.  After  that  we  set  an  optimality  criterion  for  image 
fusion.  Based  upon  that  criterion,  image  blending  and  pyramid  based  image  fusion 
algorithms  are  presented. 


2.1  Image  alignment 

When  two  images  are  taken  from  a  three-dimensional  world,  those  images  can  only 
be  overlapping  when  the  optical  centres  of  the  two  cameras  are  co-located.  When 
this  is  not  the  case,  different  distances  to  the  cameras  will  result  in  different 
disparities.  When  optical  centres  cannot  be  co-located  (due  to  the  fact  that  two 
sensors  cannot  be  present  at  the  same  location),  a  similar  solution  can  be  achieved 
by  using  dichroic  mirrors.  In  practice  this  might  prove  difficult;  partial 
compensation  can  be  achieved  by  applying  image  warping. 

Besides  the  need  to  co-locate  the  optical  centres  of  the  cameras,  the  sensors  should 
also  be  aligned  such  that  the  optical  axes  are  parallel.  The  pixel  grid  of  both 
imagers  should  also  be  aligned  such  that  each  pixel  corresponds  to  exactly  the 
same  position  in  both  images.  This  can  be  achieved  with  very  similar  sensors  such 
as  RGB  cameras;  but  this  will  prove  very  hard  with  dissimilar  apparatus  such  as  a 
visual  and  a  IR  imager.  However,  these  errors  can  be  completely  corrected  using 
image  warping. 


2.2  Image  warping 

As  shown  above,  prerequisite  to  the  image  fusion  algorithms  is  that  the  images  are 
warped.  Warping  is  a  geometric  process  where  the  image  pixels  are  moved  such 
that  after  the  warping  operation  in  both  images  pixels  at  the  same  image  co¬ 
ordinates  refer  to  the  same  location  in  the  real  world.  In  this  report  we  use  an 
affine  transformation  combined  with  bilinear  interpolation.  Image  warping,  also 
known  as  geometrical  transformation,  is  not  in  scope  of  this  report.  For  a  detailed 
description  of  image  warping  see  textbooks  such  as  [1]  chapter  17.3-17.4,  and  [2] 
chapter  5.9. 

For  the  ease  of  understanding,  in  the  rest  of  the  report  it  is  assumed  that  the  images 
are  warped  appropriately  unless  noted  otherwise. 
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2.3  Optimal  fusion  algorithms 

In  this  section,  conditions  for  optimal  fusion  are  derived.  With  optimal  we  mean 
the  notion  that  we  optimise  the  signal  to  noise  ratio.  The  outcome  of  the  derivation 
depends  on  the  assumptions  made;  for  the  following  assumptions  the  derivation  is 
shown: 

•  Identical  signal  in  both  sensors,  independent,  additive  noise 

•  Independent  signal  in  the  sensors  with  independent,  additive  noise 


2.3.1  Optimal  fusion  with  identical  signal 

We  model  the  signal  derived  from  sensor  one  as: 

Vj  =  -Sj  +<^1  +2i 

with  Sy  the  signal  of  interest,  o,  a  constant  offset,  and  n,  stochastic  noise.  Similar, 
we  have  for  sensor  2: 

V,  =  5^  +  O2  +II2 

As  we  assume  that  the  two  signals  5,  and  S2  are  identical,  we  can  write: 

S2  =  a-Sy 

We  assume  that  the  noise  is  zero  mean  white  noise;  then  we  can  write: 

=b.a„ 

/It  ^*1 

with  (7  the  standard  deviation  of  the  noise.  We  look  for  a  solution  of  the  type: 

V  =  Vy+C.V_2  (2.5) 

which  under  the  assumption  of  independent  noise  leads  to: 

v  =  Sy{\  +  c.a)  +  -yn  +  c^b'ny+Oy+c.02  (2.6) 


(2.2) 

(2.3) 

(2.4) 


Using  formula  (2.6),  we  can  give  the  signal  to  noise  ratio  (SNR): 


SNR  = 


Sy  1  +  ca 

Hx  ^jl  +  c-b- 


(2.7) 


Our  optimality  criterion  is  to  optimise  this  SNR;  this  means  finding  the  optimal 
c  given  a  and  b  : 


max  1  +  ca  max  \  +  ca  max  l  +  2ca  +  c‘a' 
c  Hx  -Jl  +  c-b^  ~  c  ^  \  +  c-b~ 


We  can  find  the  extrema  of  the  formula  by  calculating  the  derivative  to  c  and 
setting  that  to  zero;  this  leads  to: 


(2.9) 
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The  first  solution  is  a  minimum,  which  is  not  of  interest.  The  second  solution  is  the 

maximum  we  are  looking  for.  The  meaning  of  this  solution  is  that  you  should 

compensate  for  multiplication  factor  a  and  a  noise  weigh  factor  ]/,  Application 

/  b~ 

of  this  derivation  can  be  found  in  section  2.4. 


2.3.2  Optimal  fusion  with  independent  signal 

Using  similar  definitions  as  in  the  previous  section,  under  the  assumption  of 
independent  signals  for  both  sensors  we  have  a  different  relation  between  the  two 
signals.  Here  we  model  the  signal  by  its  standard  deviation;  the  relation  between 
the  signals  of  the  two  sensors  is: 

-4a.G^  (2.10) 

For  the  noise  in  both  sensors  we  use: 

<T._=Vfc.CT„  (2.11) 

Looking  for  a  solution  using  to  formula  (2.5),  we  get: 

v  =  s^^ll  +  ac+n^^[l  +  bc  +  o^+co2  (2.12) 


Again,  we  choose  c  such  that  we  optimise  the  SNR: 

max  5,  Vl  +  ac  max  Vl  +  ac  maxl  +  ac 
c  yjl  +  bc  c  -yj\  +  bc  c  \  +  bc 

Using  the  constraint  that  a,b,c'>0,we  obtain: 

J c  =  0  I  a  <b 
^°p'  |c  =  oo  I  a>b 


(2.13) 


(2.14) 


which  essentially  means:  use  the  image  with  the  best  signal  to  noise  ratio. 


2.4  Blending 


Blending  is  the  process  where  we  use  a  weighted  summation  of  the  two  sensors 
input  to  obtain  the  output: 

=/.v,,+(l-/).v,,,  (2.15) 


which  is  similar  to  formula  (2.5)  using  /  =  l/(c  + 1) .  Under  the  assumption  of 
identical  signal,  we  should  use  the  second  term  of  formula  (2.9),  which  leads  to: 


/  = 


b^ 

a  +  b^ 


(2.16) 


with  a  the  signal  ratio  between  the  two  inputs  and  Z^^the  ratio  of  the  variance  of 
the  noise  between  the  two  inputs. 
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2.5  Burt  method 


With  the  assumption  of  independent  signals,  the  image  with  the  best  SNR  should 
be  chosen.  However,  suppose  we  can  decompose  both  images  into  independent 
parts.  In  that  case,  the  result  derived  in  section  2.3.2  can  be  applied  to  these 
independent  parts.  The  algorithm  developed  by  Burt  [3]  performs  this  with  a 
pyramid  decomposition  of  the  image  as  depicted  in  Figure  2.1.  At  each  pyramid 
level,  the  Burt  algorithm  separates  the  low  and  the  high  frequency  components: 

L,=L[/,]  (2.17) 


-4  (2-18) 

with  L[]  the  convolution  operator  with  a  separated  kernel  and  as  kernel  elements 

r— -  —  where  we  used  a=  — .  The  next  level  is  afactor2 

U  2  4  4  4  2  8 

reduced  version  of  the  previous  low  frequency  component: 

4..=R[4]  (2.19) 


Low  frequency 
Components 


High  frequency 
Components 


Figure  2.1:  Pyramid  decomposition. 
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with  R[]  the  reduction  operator.  As  the  new  image  level  is  at  a  different  scale,  the 
definitions  of  high  and  low  frequencies  have  changed;  this  means  that  the 
separation  in  different  frequency  components  can  again  be  applied.  The  result  of 
the  pyramid  decomposition  is  a  set  of  high  frequency  component  images  plus  the 
highest  level  low  frequency  component: 

P  =  (2.20) 

The  total  amount  of  memory  needed  to  store  the  pyramid  decomposition  of  an 
image  is  slightly  ( ~  1.3  times)  larger  then  the  original  image.  The  meaning  of  each 
high  frequency  component  is  that  it  represents  the  information  present  in  the 
image  at  scale  k  ;  every  pixel  {i,j)  within  this  image  represents  the  local 
structure  of  scale  k  at  its  location.  Under  the  (not  completely  realistic) 
assumptions  that  the  low  pass  filtering  applied  in  formula  (2.17)  is  an  ideal  low 
pass  filter  at  Nyquist  rate  for  the  reduced  next  level,  and  the  input  image  is  band 
delimited  at  Nyquist  rate,  the  components  can  be  seen  as  frequency  bands  of 
the  original  image. 


Using  the  pyramid  decomposition  the  Burt  algorithm  constructs  the  output  image 
as  follows:  first,  combine  the  highest  level  by  blending: 


(2.21) 


Then,  for  each  level,  construct  the  fused  high  frequency  component: 

=  i  (2.22) 

(/,;•)  I  KkiiJ)  < 

which  is  performed  for  each  pixel.  The  fused  image  level  is  reconstructed  by: 

(2-23) 


The  previous  level  low  frequency  image  is  constructed  by: 


L,  =E[/^  J 


(2.24) 


with  E[]  the  expansion  operator.  This  is  repeated  until  the  fused  image  level  0  is 
formed. 


2.5.1  Reduce-expand  operation 

Actually,  the  operation  described  above  is  not  completely  accurate.  It  is  under  the 
assumption  that  the  result  of  E[R[/]]  equals  I .  The  problem  is  that  the  expansion 
operator  is  implemented  single  pixel  replication  with  zero  filling  followed  by  low 
pass  filtering  with  the  L[]  operator.  This  is  a  non-ideal  low  pass  filter  which 
results  in  artefacts.  The  solution  to  this  problem  is  to  replace  formula  (2.18)  with: 

=7, -E[R[LJ]  (2.25) 
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2.5.2  The  optimality  criterion  applied  to  the  Burt  algorithm 

In  the  Burt  algorithm  as  described  above,  no  use  has  been  made  of  the  optimality 
constraint  as  obtained  in  section  2.3.2  for  independent  sensor  signals.  If  we  assume 
that  the  pixel  value  ^  (/,  j)  represents  the  local  signal  near  {i,j)  for  scale  k , 
we  should  change  formula  (2.22)  to: 


Kk  0’7) 

Kk  O’’ 7) 

1  K4i,j)  > 

4b 

Kk  0’7) 

1  /tu0>7)  < 

Kk  0’;) 

i  4b 

4b 

(2.26) 


with  b  as  used  in  formula  (2.1 1).  This  can  easily  be  realised  by  scaling  the  images 
first  to  their  expected  noise  levels;  this  is  done  by  dividing  image  two  by  4b 
before  starting  with  the  Burt  algorithm. 


2.6  Toet  method 


The  Toet  algorithm  [4]  is  very  similar  to  the  Burt  algorithm.  The  difference  is  that 
the  ratio  of  the  low  pass  images  is  used  instead  of  the  difference.  In  formulas,  this 
means  to  replace  formula  (2.25)  by: 


e[R[4]] 


(2.27) 


and  formula  (2.23)  by: 

(2,28) 

Q  is  related  to  the  local  contrast.  It  can  be  argued  that  using  this  is  appropriate 
when  signals  are  not  additive  but  multiplicative. 


2.7  Wavelet  methods 

Wavelet  methods  are  also  a  way  to  decompose  image  into  localised  scale  specific 
signals  [5][6][7].  Within  the  family  of  wavelet  methods  many  different 
decompositions  of  an  image  are  possible.  In  fact,  the  Burt  algorithm  can  be  seen  as 
a  Wavelet  method. 

An  initial  study  into  this  subject  [7]  showed  that,  within  the  wavelets  examined,  no 
better  performance  is  found  than  achieved  with  the  Burt  algorithm.  It  is  likely  that 
the  best  choice  of  the  exact  wavelet  to  be  used  is  dependent  on  the  application  at 
hand.  Based  on  current  knowledge  it  is  probably  safe  to  state  that  for  general 
applications  no  wavelet  can  perform  significantly  better  than  the  Burt  algorithm. 
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3.  Multi-colour  image  presentation 


The  image  fusion  methods  described  in  the  previous  chapter  worked  under  the 
condition  that  the  output  image  is  of  the  same  modality  as  the  input  images.  In  this 
chapter  we  will  discuss  methods  where  the  result  of  two  monochrome  input  images 
is  a  colour  image. 


3.1  False  colour  presentation 

The  idea  with  false  colour  presentation  is  that  the  two  input  images  both  are 
assigned  to  an  image  band.  For  this,  we  should  define  three  functions; 

g(i,  j)  =  fg  (v,  {i,  j),  V2  (ri  ;■))  (3-1) 

bii,j)  =  fb(vi{i,j),V2  (ij)) 

where  the  triplet  r,g,b  is  used  as  the  red,  green  and  blue  values  of  the  pixel 
involved.  Ideally,  the  colour  mapping  functions  should  be  chosen  such  that; 

1.  Interesting  objects  (humans,  vehicles,  missiles, ...)  should  be  contrasting  to 
their  background. 

2.  Objects  are  represented  in  the  same  colour  as  percepted  in  real  life;  the  sky 
should  be  blue,  vegetation  green,  etc. 

The  first  item  has  as  a  consequence  that  the  mapping  scheme  used  is  application 
dependent.  Also,  it  means  that  some  knowledge  about  the  interesting  objects 
should  be  used  to  be  able  to  let  these  objects  stand  out. 

The  second  item  is  quite  easy  to  accomplish  for  one  weather  condition  at  a  certain 
time  of  the  day.  However,  it  is  not  easy  to  satisfy  this  requirement  for  all  possible 
weather  conditions. 

The  results  shown  in  the  next  chapter  used  a  simplified  approach  of  the  MIT 
colour  image  fusion  scheme  as  described  in  [8].  The  simplification  is  that  the 
image  enhancement  used  in  the  original  is  not  applied.  The  specific  mapping 
function  used  is; 

..  v,(/,;)  +  V2  (/,;■) 

r(i.  J)  = - ^ 

=  (3-2) 

where  v,  is  the  visual  image,  Vj  is  the  IR  image,  and  CLIP[]  an  operator  which 
sets  negative  values  to  zero. 
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3.2  Target  cueing 

In  the  previous  section  the  remark  was  made  that  for  good  contrast  between 
foreground  and  background  objects  object  knowledge  is  needed.  With  target 
cueing  this  is  used  even  more;  target  cueing  is  a  technique  where  first 
autonomously  by  the  system  the  interesting  objects  (further  called  targets)  are 
recognised,  after  which  they  are  projected  or  exaggerated  in  the  image.  An 
example  of  this  technique  is  to  visualise  hotspots  in  the  IR  image  as  red  dots  in  the 
visual  image. 


Input 

^  Subtract  ^ 

IR 

mean 

image 

Input 

''  Subtract  ^ 

visual 

mean 

image 

to  detect 
objects 


^Threshold 
to  detect 
objects 


Find 
corresponding 
objects  in  other 
image 


Figure  3.1:  Target  cueing  algorithm  used. 

Figure  3.1  shows  the  block  diagram  of  the  target  cueing  algorithm  used  for  the 
results  presented  in  the  next  chapter.  First,  from  both  images  the  temporal  mean  is 
subtracted.  For  real-time  results,  this  should  be  a  moving  average;  for  the 
presented  results  the  mean  over  the  whole  sequence  is  used.  Second,  both  images 
are  thresholded  with  an  image  specific  threshold.  For  those  pixels  which  are 
candidate  objects,  we  search  in  the  spatial  neighbourhood  for  a  corresponding 
object  in  the  other  image.  If  those  are  found,  this  blob  is  labelled  as  BOTH. 
Otherwise,  depending  on  whether  the  blob  occurred  in  the  visual  or  the  IR  image, 
the  blob  is  labelled  as  VISUAL  or  IR.  Each  found  blob  is  depicted  as  a  cross  in  the 
results  given  in  the  next  chapter.  BOTH,  VISUAL  and  IR  are  colour  coded  as  red, 
blue  and  green  respectively. 
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4.  Results 


In  this  chapter  we  show  results  for  the  fusl3,  msOl  and  UNcamp  sequences.  The 
fusl3  sequence  is  recorded  near  TNO-FEL  looking  to  a  road  with  passing  (civil) 
vehicles.  The  msOl  sequence,  provided  by  Defense  Research  Establishment 
Valcartier  [9],  shows  a  vehicle  and  a  helicopter  over  a  battlefield  with  humans  and 
a  smoke  field  blocking  visual  imaging.  The  UNcamp  sequence,  provided  by  TNO- 
TM  [8],  shows  a  scene  representative  for  situations  found  when  guarding  a  UN 
camp.  Actually,  this  sequence  is  recorded  from  the  TNO-FEL  tower,  just  as  the 
fusl3  sequence. 


4.1  fusl3  sequence 


false  colour  target  recognition 


Figure  4.1:  Image  21  from  the  fusl3  sequence. 

Figure  4.1  depicts  the  results  obtained  for  image  21  of  the  fusl3  scene.  Shown  is 
only  a  smaller  part  of  the  full  frame.  The  IR  image  is  strong  in  showing  the  left 
vehicle,  and  the  visual  image  is  strong  in  showing  the  right  vehicle.  All  fusing 
methods  are  capable  of  emphasising  both  vehicles.  The  target  recognition  image 
shown  right  bottom  shows  objects  found  in  both  channels  in  red,  and  IR  and  visual 
alarms  only  in  green  respectively  blue.  For  the  full  sequence,  using  sensor  fusion 
reduced  the  false  alarm  rate  from  88%  (IR)  respectively  161%  (visual)  to  10%  for 
the  combination.  Full  details  for  the  target  recognition  experiment  can  be  found  in 
appendix  A.  The  false  colour  image  shows  the  trees  as  red;  the  fact  that  they  are 
not  green  is  probably  related  to  the  time  of  the  day  the  image  is  taken,  early  in  the 
morning. 
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Figure  4.2  shown  image  73  from  the  msOl  sequence.  The  IR  image  shows  the 
helicopter,  vehicle  and  men,  where  the  visual  image  depicts  the  helicopter,  vehicle 
and  the  smoke.  Also,  the  mountain  in  the  background  is  much  clearer  in  the  visual 
image.  The  fused  images  show  all  interesting  details.  The  exact  shape  of  the 
vehicle  is  within  the  fused  images  the  best  recognisable  in  the  false  colour  image. 


4.3  UNcamp  sequence 


false  colour 


Figure  4.3:  Image  1  from  the  UNcamp  sequence. 

Figure  4.3  shows  the  first  image  of  the  UNcamp  sequence.  The  IR  image  depict  the 
man,  and  the  visual  image  the  gate.  On  the  fused  images  both  the  gate  and  the  man 
are  visible.  This  image  is  also  used  in  a  recent  report  of  TNO  Human  Factors 
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Research  Institute  by  Toet,  IJspeert  and  van  Dorresteijn  [8].  The  ‘blend’  and  ‘false 
colour’  images  in  this  report  are  quite  similar  to  the  images  obtained  with  the  MIT 
method  given  in  that  report,  although  the  contrast  enhancement  utilised  by  the  MIT 
method  results  in  a  brighter  results. 


4.4  Temporal  noise 

Not  visible  on  the  separate  images,  and  not  easy  to  visualise  in  a  report  like  this,  is 
the  nature  of  the  temporal  noise  present  in  the  sequences.  The  noise  behaviour  of 
both  the  blend  images  and  the  Burt  images  has  been  studied.  When  characterised 
by  standard  deviation,  there  is  little  difference  between  the  temporal  noise 
observed.  However,  with  visual  inspection  it  can  clearly  be  seen  that  there  is  a 
difference  in  temporal  noise  for  the  blend  method  and  the  Burt  method.  On  some 
parts  of  the  image  the  sequence  produced  by  the  Burt  method  seems  to  flicker, 
which  is  annoying  to  human  perception. 

The  reason  for  this  flicker  lies  in  the  nature  of  the  Burt  method.  In  equation  (2.22) 
and  equation  (2.26)  we  select  the  IR  or  the  visual  signal  depending  on  which  has 
the  highest  absolute  value.  The  absolute  value  is  dependent  on  the  temporal  noise 
in  the  channel,  and  thus  it  is  possible  that  based  upon  small  temporal  noise  changes 
in  a  temporal  sequence  sequentially  the  IR  and  the  visual  signal  can  be  selected. 
When,  for  the  spatial  frequency  band  under  consideration,  the  IR  and  visual 
signals  have  an  opposite  sign,  the  selection  change  between  the  two  bands  results 
in  flicker,  visible  when  viewing  the  sequence. 

Separate  from  this  report  a  CD  ROM  will  be  produced  which  contains  the 
complete  sequences  mentioned  in  this  report  and  the  processing  results.  This  will 
give  an  idea  of  the  temporal  behaviour  of  the  methods  discussed  in  this  report. 


4.5  Scenario  matching 

In  this  chapter  some  image  sequences  are  given  in  which  image  fusion  improves 
the  perception  of  the  scene  imaged.  However,  it  must  be  kept  in  mind  that  these 
image  sequences  are  specially  selected  for  this  report.  Under  normal  operational 
circumstances  one  of  the  two  sensors  (often  the  IR  camera)  will  give  a  superior 
image  over  the  other  sensor.  If  this  sensor’s  image  contains  all  information  about 
the  scene  relevant  to  the  current  scenario,  including  data  from  the  other  sensor  will 
not  add  information  to  the  operator.  This  leads  to  the  question  which  sensor  or 
image  fusion  scheme  must  be  used  with  a  given  scenario  and  weather  conditions. 
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4.6  Visual  appearance 

In  the  preceding  part  of  this  chapter,  the  evaluation  of  the  image  was  based  on 
information  content  only.  However,  under  operational  circumstances  also  essential 
are: 

•  Present  relevant  information  only,  or  at  least  emphasized. 

•  The  resulting  image  should  be  easily  interpretable  for  the  user. 

In  general,  the  first  point  needs  information  about  the  current  task  to  be  performed 
to  separate  relevant  information  and  irrelevant  information.  In  the  results  presented 
this  has  been  performed  with  the  target  recognition  approach,  where  the  possible 
targets  are  emphasized  with  a  red  marker.  For  a  user  of  the  images  who  is  not  well 
trained  in  the  recognition  of  IR  images,  the  fused  images  are  more  easily  to 
interpret.  Especially  the  false  colour  images  are  quite  close  to  normal  perception. 
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5.  Computational  aspects  of  image  fusion 

For  the  research  described  in  this  report  in  1994  the  Sensar  VFE-100  image 
processor  was  obtained  by  TNO-FEL.  It  turned  out  that  the  VFE-100  was  very 
useful  in  the  process  of  getting  feeling  with  image  fusion.  The  image  processor 
enables  real-time  processing  for  this  kind  of  problems,  and  as  such  proved  to  be 
very  valuable.  Although  the  VFE-100  is  capable  of  most  of  the  algorithms 
described  in  this  report,  it  is  not  utilised  in  this  research  for  the  following  reasons: 

•  The  VFE-100  is  designed  to  do  ail  the  processing  from  input  to  display.  Due  to 
the  fact  that  digital  I/O  is  not  (reliable)  possible,  it  is  very  hard  to  use  such  a 
machine  in  a  test-and-improve  environment. 

•  Programming  a  machine  specially  designed  for  a  specific  task  is  much  harder 
then  a  general  purpose  computer  such  as  a  current  workstation.  Since  research 
on  these  topics  include  a  lot  of  trail  and  error,  the  additional  effort  needed  to 
program  the  VFE-100  seemed  not  worth  the  effort. 

•  The  display  of  the  VFE-100  is  not  capable  of  displaying  images  in  colour. 

•  Many  algorithms  applied  need  floating  point  numbers.  The  VFE-100  can  only 
handle  8  bit  integers.  This  can  result  in  unwanted  rounding  errors. 

Similar  problems  will  also  yield  in  the  future.  To  overcome  these  problems  a  fast 
implementation  on  a  general  purpose  processor  is  preferably.  With  the  increase  of 
processor  speed,  realisation  on  generic  computer  architectures  should  be  possible 
in  the  near  future.  The  remainder  of  this  chapter  will  show  that  current  plug-in 
DSP  boards  are  capable  for  task  such  as  real-time  image  fusion. 

The  computational  aspects  of  image  fusion  can  be  divided  into  two  parts: 

•  computations  needed  for  image  warping 

•  computations  needed  for  image  fusion 

These  parts  are  discussed  separately.  In  these  discussions,  we  will  focus  on  the 
floating  point  operations  needed,  based  upon  the  assumption  that  floating  point 
operations  are  much  more  expensive  then  integer  operations. 


5.1  Computing  image  warping 

The  computations  needed  for  image  warping  consist  of  two  pieces: 
computing  the  location  of  the  output  pixel  in  the  input  image  co-ordinates, 
and  computing  the  interpolated  value  at  this  input  image  co-ordinate.  Bilinear 
interpolating  seems  adequate;  however,  to  save  computations  one  might  consider 
zero  order  interpolation. 

Image  warping  cannot  be  implemented  in  a  standard  image  pipeline:  beforehand 
one  cannot  tell  in  which  order  the  input  pixels  are  needed,  and  thus  the  pixels 
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cannot  be  processed  sequentially  from  input  to  output.  This  means  that  a  frame 
store  must  be  used  to  store  an  intermediate  image. 


5.1.1  Affine  transformation 

The  computation  of  the  location  of  the  output  pixel  in  input  co-ordinate  space  is 
done  by  an  affine  transformation;  the  formula  for  an  affine  transformation  is: 


Cly2 

^^21  ^22  J 

(5.1) 


which  comes  to  four  multiplications  and  six  additions  for  each  pixel.  Alternatively, 
one  could  use  the  fact  that  every  pixel  on  each  image  row  must  be  processed;  in 
that  case  the  equation  becomes: 


[n  + 1]^ 


+ 


(5.2) 


which  are  2  additions  for  each  pixel. 


5.1.2  Bilinear  interpolation 

Bilinear  interpolation  is  specified  by  the  formulas: 


ix  =  floor[x]  (5.3) 

iy  =  floor[y]  (5.4) 

fx  =  x-ix  (5.5) 

fxinv  =  l  —  fx  (5.6) 

jy  =  y-iy  (5.7) 

tl  =  im[ix, iy}  ■  fxinv  +  im[ix  +  1,  iy]  ■  fx  (5.8) 


t2  =  im[ix,iy  -I- 1]  •  fxinv  4-  imlix  +  l,iy  +  1]-  fx 


(5.9) 


o  =  t\-(l-fy)  +  t2-fy  (5.10) 

This  comes  down  to  first  doing  a  linear  interpolation  in  the  pixel  row  above  and 
below  the  point,  followed  by  a  linear  interpolation  between  the  values  of  those 
rows. 

The  computational  complexity  of  this  process  is  six  multiplications  and  seven 
additions/subtractions  for  each  pixel,  plus  four  indirection  in  a  double  array  to 
access  the  image  data  used. 
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5.2  Computations  needed  for  image  fusion 

This  section  will  give  the  computational  complexity  of  the  Burt  algorithm.  The 
complexity  of  the  Burt  algorithm  is  typical  for  image  pyramid  approaches. 

Within  the  pyramids,  most  calculations  (about  75%)  is  needed  in  the  first  pyramid 
level.  We  will  estimate  the  number  of  calculations  needed  for  that  level,  and 
estimate  the  total  number  of  calculations  for  the  total  pyramid  from  that  number. 

The  most  expensive  part  of  the  Burt  algorithm  is  the  low  pass  filtering.  This  is 
performed  both  in  the  low  pass  filtering  itself,  given  in  formula  (2.17),  but  also  in 
the  expansion.  As  reduce-expand  has  to  be  used,  we  have  for  each  level  three 
expand  operations:  two  in  pyramid  formation,  and  one  in  image  formation  from  the 
two  pyramids.  Also,  we  have  two  low  pass  filtering  operations  for  the  low  pass 
filtering  itself.  This  adds  up  to  5  low  pass  filtering  operations. 

Each  low  pass  filtering  consists  of  a  separable  convolution  with  a  symmetric  filter 
of  width  5.  This  gives  per  pixel  3  multiplications  and  5  additions.  So,  for  all  the 
low  pass  filtering  operations  in  the  first  pyramid  level  we  need  15  multiplications 
and  25  additions;  for  all  pyramid  levels  this  adds  up  to  20  multiplications  and  33 
additions  per  pixel. 


5-3  Total  computational  complexity 

The  three  separate  contributions  in  the  computational  complexity  of  affine 
transformation,  bilinear  interpolation  and  pyramid  fusion  add  up  to  42  additions, 
26  multiplications  and  4  image  data  accesses.  For  fusion  of  768x512  image 
streams  at  25  frames/second,  we  have  an  incoming  pixel  rate  from  each  sensor  of 
about  10  Mpixel/second.  This  brings  the  complete  computational  budget  near  1 
Gflops.  This  is  just  within  reach  of  the  newest  DSP  boards,  such  as  the  Matrox 
Genesis  with  multiple  Texas  Instruments  TMS320C80  DSPs. 


TNO  report 


FEL-97-B046 


22 


6.  Conclusions  and  recommendations 


In  this  chapter  we  present  some  conclusions  drawn  from  material  presented  in  the 
previous  chapters.  At  the  end  of  this  chapter  we  will  present  recommendations  for 
application  of  this  knowledge  as  well  as  new  research  directions  will  be  given. 

In  chapter  two  a  theoretical  discussion  is  presented  about  optimal  fusion.  As  a 
result  it  is  demonstrated  that  for  dependent  images  blending  is  the  optimal 
solution,  and  that  for  independent  images  a  select  best  strategy  should  be  followed. 
In  great  detail  the  Burt  method  is  given  as  an  example  of  a  select  best  strategy;  also 
a  deduction  is  presented  how  images  with  different  signal  to  noise  ratios  should  be 
combined. 

In  chapter  three  the  basics  of  two  methods  leading  to  colour  output  images  are 
presented. 

Chapter  four  presents  some  results  for  three  image  sequences  with  the  methods 
introduced  earlier.  From  these  images,  one  can  deduce  that  no  deciding  advantage 
can  be  demonstrated  between  the  blend,  Burt  and  false  colour  images.  It  must  be 
noted  that  the  images  presented  in  this  report  are  selected  for  inclusion;  in  realistic 
scenario’s  quite  often  one  of  the  two  inputs  will  be  superior  to  the  other. 

Chapter  five  gives  a  theoretical  overview  of  the  computational  aspects  for  image 
fusion.  The  main  conclusion  drawn  in  that  chapter  is  that  real-time  application  of 
image  fusion  with  a  pyramidal  approach  such  as  the  Burt  method  is  just  within  the 
capabilities  of  the  newest  DSP  boards. 


6.1  Recommendations  for  using  image  fusion 

•  When  a  platform  is  equipped  with  multiple  imaging  sensors,  make  the 
possibility  to  use  image  fusion  available.  As  shown,  there  are  scenarios  where 
fused  images  show  more  information  than  each  of  the  input  images  separately. 

•  When  a  platform  is  equipped  with  multiple  imaging  sensors,  and  these  sensors 
are  sequentially  displayed  on  the  same  screen,  apply  image  warping  to  the 
images  even  when  no  explicit  image  fusion  is  applied.  This  will  guarantee  that 
when  switching  between  images  the  location  of  the  objects  remains  the  same. 
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6.2  Recommendations  for  further  research 

•  The  examples  indicate  that  colour  image  output  is  valuable.  Further  research  is 
needed  to  determine  the  improvement  which  can  be  reached  with  colour 
displays  over  monochrome  displays,  and  whether  the  improvement  is  enough 
to  compensate  for  less  sharpness  in  colour  displays  due  to  colour  masks. 

•  The  flicker  of  the  Burt  images  is  caused  by  different  temporal  noise 
realisations  for  separate  frames  in  the  sequence.  Further  research  could  be 
carried  out  to  investigate  into  the  possibility  to  reduce  this  flicker  by  using 
knowledge  how  the  decisions  where  made  for  previous  frames. 

•  The  target  recognition  example  indicates  that  the  use  of  multiple  imaging 
sensors  improves  target  recognition  rate.  With  target  recognition,  it  is  not 
necessary  that  all  sensors  used  have  the  same  modality  and/or  input  rate.  This 
opens  the  possibility  to  integrate  radar  and  laser  range  sensors  with  imaging 
sensors.  Usage  of  target  tracking  might  lead  to  higher  signal  to  noise  ratios  due 
to  suppresses  temporal  noise. 

•  Both  image  fusion  and  image  enhancement  are  promising  techniques  to 
improve  the  perception  of  the  scene  by  the  operator.  A  combination  of  those 
techniques  seems  promising.  Future  research  should  provide  insight  in  the 
relation  between  image  fusion  and  image  enhancement,  and  under  what 
operational  circumstances  what  technique,  or  combination  of  techniques, 
should  be  utilised. 

•  As  it  is  unlikely  that  for  all  operational  circumstances  an  optimal  algorithm 
exists,  a  scheme  must  be  designed  to  select  the  best  algorithm  given  the  known 
operational  variables.  This  should  include  detection  of  sensor  jamming  and 
evading  its  artefacts. 
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Appendix  A  Target  recognition  experiment 

An  evaluation  is  made  from  the  sequence  with  target  cueing;  an  example  of  this 
sequence  is  depicted  in  figure  4.1.  For  each  of  the  twenty-six  images  in  the 
sequence,  a  visual  evaluation  is  made.  Within  that  visual  evaluation  it  was  decided 
whether  the  object  found  was  a  true  object,  or  a  false  alarm.  Also  the  type  of  alarm 
was  noted:  in  the  IR  image  only,  in  the  visual  image  only,  or  in  both  channels.  The 
results  of  this  evaluation  are  shown  in  table  A.l. 


Table  A.  1 :  Recognition  rates  for  all  frames. 


Correct 

Wrong 

IR  Vis. 

Comb. 

IR 

Vis. 

Comb. 

0 

1 

1 

1 

1 

2 

2 

1 

4 

3 

1 

3 

4 

1 

1 

2 

5 

1 

1 

2 

1 

6 

2 

1 

1 

7 

2 

3 

1 

8 

2 

2 

3 

9 

2 

4 

2 

10 

2 

3 

2 

11 

1 

1 

1 

4 

1 

12 

1 

1 

6 

13 

1 

1 

4 

14 

2 

2 

5 

15 

2 

1 

1 

16 

2 

2 

3 

17 

2 

3 

1 

18 

1 

2 

1 

5 

19 

2 

2 

5 

20 

2 

1 

5 

21 

2 

2 

3 

22 

2 

1 

6 

1 

23 

2 

3 

3 

1 

24 

1 

1 

3 

2 

25 

1 

1 

4 

26 

1  1 

2 

2 

Totals; 

3  7 

41 

41 

78 

4 

The  totals  given  in  table  A.  1  can  be  used  to  determine  the  detection  quality  of 
using  IR  only,  visual  only,  or  the  combination  of  both.  First,  we  need  to  determine 
the  cumulative  number  of  objects  present  in  the  sequence;  this  is  3+7+41=51 
objects.  Of  these  51  objects,  in  the  IR  images  3+41=44  objects  can  be  seen,  which 
is  86%.  In  the  visual  images  7+41=48  objects  can  be  seen,  which  is  94%.  In  the 
combination  41  objects  are  recognised,  which  is  80%.  Second,  we  determine  the 
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false  alarm  rates.  For  IR,  this  is  (41+4)/5 1=88%.  For  visual  images,  this  is 
(78+4)/51=161%.  For  the  combination,  this  is  4/41=10%. 


Table  A.2:  Detection  and  false  alarm  rates. 


infrared 

visual 

combination 

detection  rate 

86% 

94% 

80% 

false  alarm  rate 

88% 

161% 

10% 

For  this  single  experiment  this  means  that  when  using  both  sensor  signals 
accepting  a  slight  decrease  of  detection  rate  results  in  a  very  big  decrease  of  the 
number  of  false  alarms.  More  realistic,  one  strives  for  a  fixed  false  alarm  rate  for 
all  methods,  which  can  be  achieved  by  varying  thresholds  used.  Although  it  cannot 
be  concluded  from  the  results  given  here,  experiments  have  shown  that  with  such  a 
fixed  false  alarm  rate  the  combination  of  both  sensors  results  in  the  highest 
detection  rate. 
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